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Abstract 

Traditional  performance  debng^g  and  tuning  of  parallel  programs  is  based  on  the  "measure- 
modify”  approach,  in  which  detailed  measurements  of  program  executions  are  used  to  guide 
incremental  changes  to  the  program  that  result  in  better  performance.  Unfortunately,  the 
performance  of  a  parallel  algorithm  is  often  related  to  its  implementation,  input  data,  and 
machine  characteristics  in  surpriring  ways,  and  the  neasure-modify*’  approach  is  unsuited 
to  exploring  these  relationships  fully:  it  is  too  heavily  dependent  on  experimentation  and 
measurement,  which  is  impractical  for  studying  the  large  number  of  variables  that  can  af¬ 
fect  parallel  program  performance.  In  this  paper  we  argue  that  the  problem  of  selecting  the 
best  imi^ementation  of  a  parallel  algorithm  requires  a  new  approach  to  parallel  program 
performance  evaluation,  one  mth  a  greater  balance  between  measurement  and  modeling. 
We  first  present  examples  that  demonstrate  that  different  parallelizations  of  a  program  may 
be  necessary  to  achieve  the  best  possible  performance  as  one  varies  the  input  data,  machine 
architecture,  or  number  of  processors  used.  We  then  present  an  approach  to  performance 
evaluation  baaed  on  lost  cycles  analysis^  wluch  involves  measurement  and  modeling  of  all 
sources  of  overhead  in  a  parallel  program.  We  describe  a  measurement  tool  for  lost  cycles 
analyris  that  we  have  incorporated  into  the  runtime  environment  for  Fortran  programs  on 
the  Kendall  Square  KSRl,  and  use  this  tool  to  analyze  the  performance  tradeoffs  among 
imi^mentations  of  2D  FFT  and  parallel  subgraph  isomorphism.  Using  these  examples,  we 
show  how  lost  cycles  analysis  can  be  used  to  solve  the  problems  associated  with  selecting 
the  best  implementation  in  a  variable  environment.  In  addition,  we  show  that  this  approach 
can  capture  large  amounts  ci  performance  data  oring  only  a  small  number  of  measurements, 
and  that  it  is  fleodUe  enough  to  allow  conclusions  to  be  drawn  from  empiric.^  data  in  some 
cases,  and  analytic  results  in  other  cases. 
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Ordtr  No.  8S30).  Marie  Ciovdla  is  siq)parted  by  an  ARPA  Researdi  Asristantship  in  Performance 
Oonqputing  administered  by  the  Institute  for  Advanced  Computer  Studies,  University  of  Maryland. 
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1  Introduction 

The  perfomance  of  a  parallel  algorithm  is  often  related  to  its  implementation,  input  data, 
and  machine  characteristics  in  surpriring  ways.  Most  programmers  do  not  experiment  with 
a  wide  variety  of  alternative  implementations  or  data  inputs  when  tuning  an  application 
however.  This  lack  of  experimentation  is  due  primarily  to  the  difficulty  of  restructuring 
a  parallel  application,  and  the  enormous  number  of  possible  implementations  and  inputs. 
Recent  work  on  programming  languages  and  environments  b  detigned  to  address  the  first 
problem,  by  allowing  the  user  to  eatily  imi^ement  multiple  parallelbations  within  the  same 
program  framework  [Alverson  and  Notion,  1992;  Coffin,  1990;  Growl  and  LeBlanc,  1991; 
Habtead,  1985].  Nonetheless,  the  sheer  rize  of  the  parameter  space  effectively  limits  the 
extent  of  experimentation. 

In  thb  paper  we  argue  that  understanding  the  behavior  of  parallel  programs  as  a  function 
of  implementation  and  input  represents  a  performance  evaluation  problem  that  b  important 
and  yet  fundamentally  different  from  the  traditional  view  of  parallel  performance  tuning. 
Given  that  multiple  alternative  implementations  of  a  parallel  program  may  be  needed  to 
achieve  the  best  postible  performance  as  one  varies  the  input,  the  machine  characteristics, 
the  number  of  processors  used,  and  other  aspects  of  the  execution  environment,  how  does 
the  programmer  sort  through  the  many  possible  implementations  and  determine  the  cir¬ 
cumstances  under  which  each  performs  best?  More  specifically,  how  does  the  programmer 
predict  crossover  points  at  which  one  implementation  outperforms  another? 

In  thb  paper  we  describe  a  methodology  and  associated  toob  for  solving  these  problems. 
Our  approach  b  based  on  metrics  for  evaluating  sources  of  overhead  in  parallel  programs, 
referred  to  as  lost  cycles.  These  metrics,  and  our  entire  approach,  draw  on  exbting  work  in 
both  performance  evaluation  and  scalability  analyris. 

We  describe  a  measurement  and  evaluation  tool  set  based  on  our  metrics  that  comlnnes 
the  advantages  of  empirical  performance  measurement  with  the  predictive  power  of  analytic 
performance  modeling.  Although  our  focus  b  on  demonstrating  the  utility  of  our  approach 
and  toob  when  solving  the  problems  that  arise  in  selecting  among  alternative  implemen¬ 
tations,  we  also  show  that  the  toob  can  be  used  to  predict  large  amounts  of  performance 
data  based  on  a  small  number  of  measurements. 


2  Solving  the  Best  Implementation  Problem 

2.1  An  Example  Problem 

An  example  that  demonstrates  the  difficulties  posed  by  multiple  implementations  b  paral- 
lefiting  an  algorithm  for  the  svbgraph  isomorphism  problem.  Given  two  graphs,  one  small 
and  one  large,  the  subgraph  isomorphbm  problem  b  to  find  one  or  more  bomorphbms  from 
the  small  graph  to  arbitrary  subgraphs  of  the  large  graph.  An  bomorphbm  b  a  mapinng 
from  each  vertex  in  the  small  graph  to  a  unique  vertex  in  the  large  graph,  such  that  if  two 
vertices  are  connected  by  an  edge  in  the  small  graph,  then  thdr  corresponding  vertices  in  the 
large  graph  are  also  connected.  The  bade  algorithm  we  use  organizes  possible  solutions  into 
a  tree,  and  searches  the  tree  for  actual  solutions.  Subgraph  bomorphbm  b  NP-complete, 
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but  by  a4>plying  fikers  at  each  node  of  the  search  tree,  large  portions  of  the  search  space 
can  often  be  pruned,  allows  solutions  to  be  found  in  a  reasonable  amount  of  time. 

This  algorithm  has  a  number  of  potential  paralldizations  including: 

IVee  parallelism  searches  subtrees  of  the  root  node  in  parallel; 

Loop  parallelism  paralleUzes  the  loops  within  each  filter  (since  all  the  filters  work  by  iter¬ 
ating  over  the  nodes  in  each  graph); 

Inatrnetion  paralldism  packs  graph  connectivii^  data  as  bitmaps  into  words,  allowing 
set  intersection  operations  to  be  implemented  as  Boolean  operations  within  the  filter 
loops. 

We  have  created  a  program  to  solve  subgraph  isomorphism  that  can  implement  tree,  loop, 
and  instruction  parallelism,  in  any  combination.  In  this  paper,  we  refer  to  these  variations  in 
the  structure  of  a  program,  such  as  different  algorithms,  different  parallelizations,  different 
task  schedules,  and  different  synchronization  methods  as  different  implementations.^  Thus 
our  program  incorporates  8  different  implementations. 

Our  program  runs  on  7  shared-memory  multiprocessors:  the  Sequent  Balance,  the  Se¬ 
quent  Symmetry,  the  Silicon  Graphics  Iris,  the  BBN  Butterfiy,  the  BBN  TC2000,  the  IBM 
8GE,  and  the  I^Rl.  All  of  these  machines  have  at  least  8  processors;  on  some  machines 
we  used  as  many  as  32  processors. 

Input  to  the  program  consists  of  two  graphs,  generated  randomly.  Based  on  the  random 
process  used  to  construct  the  two  graphs,  we  can  estimate  the  probability  that  any  given 
leaf  node  in  the  search  tree  represents  a  valid  isomorphism,  which  we  call  the  density  of  the 
solution  space.  When  the  small  graph  has  few  edges  and  the  large  graph  has  many  edges, 
the  solution  space  is  dense;  when  the  small  graph  has  many  edges  and  the  large  graph  has 
few  edges,  the  solution  space  is  sparse. 

The  program  can  search  for  any  number  of  isomorphisms;  in  our  experiments  we  vary 
the  number  of  solutions  requested  from  1  to  256.  We  refer  to  this  as  varying  the  problem, 
since  the  implementation  and  input  are  fixed. 

Different  parallelizations  have  widely  differing  performance  as  a  function  of  machine, 
number  cS  processors,  input,  and  problem.  The  performance  of  each  parallelization  is  a 
function  whose  domain  is  this  4-dimensional  space.  Problems  requiring  selecting  among  the 
various  parallelizations  can  come  in  many  forms: 

•  for  a  fixed  machine,  number  of  processors,  and  problem,  we  may  need  the  best  paral¬ 
lelization  as  we  vary  the  input  dentity; 

e  for  a  fixed  machine,  number  of  processors,  and  input  dentity,  we  may  need  the  best 
parallelization  as  we  vary  the  proUem; 

•  for  a  fixed  machine,  input  dennty,  and  problem,  we  may  need  the  best  parallelization 
as  we  vary  the  number  of  processors;  uid 

^DifloreDt  impfemeDtatioos  could  even  indude  the  uee  of  difoent  runtime  libraries  or  various  compiler 
optimisatkioe. 
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Varying: 

Fixed: 

Machine 

128  solna,  sparse 

Density 

Butterfly,  1  soln 

Problem 

Symmetry,  dense 

8CE  Iris  Symm.  KSRl 

1  128  256 

Loop 

Tkee 

212  2JIfi  29.7  10.7 

36.1  2.60  IM 

Qja  33.7  541.5  1.77 

2.33  3.76  8.00  1.49 

0.32  1.31  2.32 

1.32  1.67  LSQ 

l^ble  1:  Comparison  of  Loop  and  Parallelism  in  Varying  Environment 


Figure  1:  Comparison  of  Four  Parallelizations,  Varying  Processors 

•  we  may  need  the  best  parallelization  for  a  fixed  input  density  auid  problem  as  we  port 
the  program  across  machines. 

One  might  assume  that  for  some  of  these  cases,  the  best  parallelization  does  not  vary, 
making  the  decision  easy.  In  fact,  we  show  in  a  detailed  study  of  this  application  [Crowl 
et  al.,  1993]  that  in  none  of  these  cases  is  the  beet  paraUelization  fixed;  the  choice  of  which 
parallelization  of  subgraph  isomorphism  performs  best  varies  in  all  cases  by  significant 
marpns.  Other  researchers  have  also  noted  that  the  best  puallelization  for  a  ^ven  problem 
can  vary  depending  on  the  input,  machine,  or  problem  [Eager  and  Zahorjan,  1993;  Rao  and 
Kumar,  1989;  Subhlok  et  al.,  1993]. 

Examples  of  these  effects  are  shown  in  l^ble  1.  This  table  shows  the  best  running  time 
in  seconds  for  loop-  and  tree-parallel  implementations,  while  varying  one  component  of  the 
enWronment.  The  underlined  entries  in  the  table  are  the  better-performing  executions.  The 
table  shows  that:  1)  when  we  seek  128  solutions  in  a  sparse  solution  space,  some  machines 
prefer  loop  parallelism,  while  others  prefer  tree  parall^m;  2)  when  we  seek  1  solution  on 
the  Butterfly,  as  the  denrity  of  the  solution  space  varies  from  10~^  (dense)  to  10~^  (sparse) 
and  Anally  to  an  empty  solution  space,  the  best  parallelization  varies;  and  3)  in  searching  a 
dense  solution  space  on  the  Symmetry,  loop  parallelism  is  preferable  when  seeking  1  or  128 
solutions,  but  tree  parallelism  is  preferable  when  seeking  256  solutions. 
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An  example  of  how  the  beet  parallelization  varies  as  we  vary  the  number  of  processors 
need  is  shown  in  Figure  1.  This  figure  shows  the  running  time  of  four  parallelizations  (tree, 
tree  combined  with  instruction,  loop,  and  loop  combined  with  instruction)  on  the  Silicon 
Gr^hics  Iris.  It  shows  that  for  some  numbers  of  processors  (1,  and  4-8),  tree  parallelism 
ootperfonns  the  others;  for  other  ranges  of  processors  (2-3)  loop  and  instruction  parallelism 
outperforms  the  others;  and  ndther  loop  parallelism  nor  tree  combined  with  instruction 
paralldism  ever  perform  better  than  all  the  others.  It  also  shows  that  adding  instruction- 
level  paraltelism  to  loop  parallelism  improves  it  significantly,  while  adding  instruction-level 
paralleUsm  to  tree  parallelism  has  a  slightly  negative  eflfect  on  performance. 

These  effects  can  be  explained  in  terms  of  load  imbalance,  communication  costs,  pro- 
osasor  speed,  pure  computation  costs,  and  speculative  computation  (parallelism  used  to 
discover  alternative,  cheaper  solutions  rather  than  speeding  the  discovery  of  a  particular 
solution).  Detailed  explanations  are  presented  in  [Growl  et  at.,  1993];  here  we  note  only 
that  the  inmght  necessary  to  explain  the  relative  performance  of  these  implementations  can 
be  derived  ^most  exclusively  from  high-level  notions  such  as  load  imbalance  and  commu¬ 
nication  costs,  without  making  detailed  measurements  of  each  execution. 

We  will  call  the  four  dimenuons  of  the  domain  of  an  implementation’s  performance  func¬ 
tion  (input  dentity,  problem,  number  of  processors,  and  machine)  environment  variables. 
An  execution  consists  of  running  an  implementation  for  a  fixed  set  of  environment  variables. 
Additional  environment  variables  exist  for  other  programs;  for  example,  in  many  problems 
the  size  of  the  input  data,  rather  than  its  internal  structure  (e.g.,  solution  space  density), 
is  the  most  important  factor  in  application  performance. 


2.2  Performance  T\iiiing  By  Selecting  Implementations 

Most  often,  parallel  performance  evaluation  and  toning  is  concerned  with  the  performance 
of  a  tingle  execution  of  an  appUcation.  In  contrast  to  toning  an  application  by  focusing  on 
spedfic  code  s^ments  to  improve,  our  study  of  subgraph  isomorphism  indicates  that  there 
is  a  prior,  more  general  performance  tuning  problem:  under  what  circumstances  is  each 
implementation  best?  The  previous  section  gave  typical  examples  of  these  performance 
tuning  problems:  var^g  input  dentity  or  tize,  problem,  number  of  processors,  or  machine. 
Each  of  these  proldems  corresponds  to  searching  a  subspace  of  the  environment  space.  The 
previous  section  also  showed  that  no  dimention  of  the  environment  is  trivial.  To  be  able 
to  solve  all  of  these  performance  tuning  problems  by  selecting  the  proper  implementation 
in  each  case  requires  knowing  the  rtiative  performance  of  each  implementation  over  the 
entire,  high-dimentional  entironment.  We  call  this  discovery  of  the  relative  performance  of 
all  implementations  the  best  implementation  problem.  We  view  the  best  implementation 
problem  as  a  necessary  precursor  to  the  traditional  style  of  performance  toning  that  focuses 
on  improving  individual  (often  serial)  code  segments. 

One  way  to  solve  the  best  implementation  problem  is  to  measure  the  performance  of  all 
available  implementations  over  the  entire  environment  space.  Unfortunately  this  solution  is 
exponential  in  the  number  of  environmental  dimentions.  Thus,  even  for  timple  programs,  it 
is  a  huge  task.  In  our  study  of  subgraph  isomorphism  we  measured  over  37,000  executions, 
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while  solving  only  a  subset  of  the  best  implementation  problem.  We  need  some  way  to 
reduce  the  complexity  of  the  problem. 

In  many  cases  we  can  solve  the  best  implementation  problem  using  methods  that  are  only 
linear  in  the  number  of  dimenuons.  The  key  insight  lies  in  the  explanatory  power  of  simple 
overhead  cat^joriee  like  those  used  to  explain  the  performance  of  subgraph  isomorphism: 
load  imbalance,  communication  costs,  and  wasted  (speculative)  computation.  Measuring 
these  overheads  directly  for  the  entire  environment  space  is  still  impractical,  but  if  the 
categories  are  chosen  properly,  modeling  them  as  a  separate  function  of  each  environment 
variable  is  feasible.  A  small  number  of  measurements  for  each  environment  variable  will 
then  suffice  to  parameterize  the  models,  leading  to  an  aggr^ate  model  of  performance 
prediction  spanning  the  entire  environment  space. 

Given  performance  models  for  each  implementation  that  span  the  entire  environment 
space,  the  selection  of  best  implementation  is  straightforward.  When  an  application  is  to 
be  ported,  or  run  on  a  different  kind  of  data  set,  or  run  on  a  different  number  of  processors, 
the  performance  models  can  be  quickly  reduced  to  functions  of  the  environment  variable(s) 
of  interest.  The  crossover  boundaries  at  which  one  implementation  outperforms  another 
are  then  obtained  by  directly  solving  the  performance  functions  as  simultaneous  equations. 
In  the  context  of  a  parallel  programming  environment  these  performance  models  would  be 
assodated  with  thdr  implementations,  for  ready  use  as  implementation-selection  dedsions 
arise. 

Tlus  approach  is  in  sharp  contrast  to  traditional  performance  debugging,  which  is  usu¬ 
ally  an  ^Iterative  task  alternating  between  measuring  and  modifying  the  performance  of 
successive  computation  prototypes.”  [Lehr  et  al.^  1989]  Our  approach  is  complementary  to 
‘^measure-modify”  performance  debugging,  but  our  problem  is  one  that  requires  modeling 
to  play  a  more  prominent  role.  If  prc^ammers  are  to  explore  a  variety  of  vastly  different 
implementations  before  beginning  to  fine-tune  the  implementation  with  the  best  high-level 
structure  for  a  ^ven  environment,  we  must  find  a  way  to  minimize  the  role  of  measurement, 
and  exploit  the  fact  that  many  differences  in  high-level  program  structure  are  amenable  to 
analyus. 

2.3  Lost  Cycles 

In  order  for  models  based  on  simple  overhead  categories  to  be  useful,  all  categories  must 
be  measured  using  the  same  metric.  We  call  this  metric  lost  cycles.  Lost  cydes  are  simply 
aggregate  seconds  of  parallel  overhead,  attributed  to  various  categories.  Lost  cydes  is  an 
important  notion  because  it  allows  us  to  quantitatively  study  tradeoffs  among  effects  such 
as  serial  fraction,  synchronization,  communication,  and  contention  that  are  often  measured 
and  modeled  in  incompatible  ways.  The  portion  of  the  execution  time  not  consumed  in  loct 
cydes  we  refer  to  as  pure  computation. 

The  core  of  our  approach  is  the  proper  selection  of  categories.  To  be  successful,  lost 
cycles  must  be  allocated  to  a  set  of  cat^ories  that  together  meet  three  criteria: 

1.  Ck>mpletene8s.  The  categories  must  capture  oil  sources  of  overhead. 

2.  Orthogonality.  The  categories  must  be  mutually  exdusive. 


3.  Meaning.  The  categories  most  correspond  to  states  of  the  execution  that  are  mean¬ 
ingful  for  analysis. 

Although  often  overlooked,  completeness  is  a  crucial  criterion.  Completeness  ensures 
that  ive  do  not  ignore  any  overheads  as  we  vary  environment  variables,  regardless  of  vdiether 
we  expect  them  to  be  dominant.  Completeness  is  rarely  achieved  in  performance  measure¬ 
ment  tools;  many  tools  concentrate  on  spedlic  performance  metrics  such  as  cache  miss  rates, 
message  traffic,  and  execution  profiles.  Each  of  these  tools  is  useful  as  long  as  the  tool’s 
metric  corresponds  to  a  dominant  source  of  overhead.  However,  predicting  which  category 
of  overhead  is  dominant  in  all  cases  is  a  very  difficult  task  —  it  is  not  uncommon  that 
performance  is  dominated  by  unexpected  effects. 

Completeness  is  also  rarely  achieved  because  it  requires  measurement  of  effects  that 
occur  at  different  levels  —  application  (e.g.,  load  imbalance)  and  hairdware  (e.g.,  resource 
contention).  A  system  that  attempts  to  measure  lost  cycles  therefore  must  be  able  to 
instrument  the  apiffication,  as  well  as  have  access  to  hardware  performance  data. 

Completeness  and  orthogonaUty  together  ensure  that  we  can  correctly  measure  lost 
cycles,  and  indirectly,  pure  computation.  Completeness  ensures  that  measurements  of  pure 
computation  are  accurate.  Orthc^onality  ensures  that  we  can  subtract  overheads  from 
running  time  to  calculate  pure  computation  in  the  natural  way. 

Meaningfulness  of  categories  serves  a  rather  different  purpose,  and  makes  the  choice 
cd  categories  somewhat  more  difficult.  Categories  must  be  meaningful  so  that  they  are 
likely  to  be  amenable  to  simple  analysis.  It  is  certainly  posdble  to  define  a  set  of  overhead 
categories  that  are  complete  and  orthogonal,  but  without  meaning  for  analytic  purposes 
—  the  amplest  such  set  would  contain  one  cat^ory  for  all  lost  cycles.  Thus  the  challenge 
in  defining  a  category  set  is  in  dividing  overheads  finely  enough  that  they  can  be  analyzed 
amply,  but  not  so  finely  as  to  present  problems  in  measurement,  or  in  verifying  completeness 
and  orthogonafity. 

Meaningfulness  of  categories  also  allows  the  programmer  to  relate  the  measurements 
made  to  the  program  being  studied.  Categories  that  are  amenable  to  analysis  tend  to 
correspond  in  simple  ways  to  the  structure  of  the  program.  As  a  result,  measurements 
of  meaningful  categories  can  proidde  significant  performance  tuning  assistance  even  apart 
from  thdr  use  in  analytic  models. 

2.4  Related  Work 

The  majority  of  the  tools  and  metrics  devised  for  performance  evaluation  and  tuning  reflect 
thdr  orientation  on  the  measure-modify  paradigm  [Callahan  et  al.,  1990;  Cybenko  et  al., 
1991;  Davis  and  Hennessy,  1988;  Dongarra  et  al.,  1990;  Goldberg  and  Hennessy,  1993; 
Heath  and  Etheridge,  1991;  Kohn  and  Williams,  1993;  So  et  al.,  1987].  Such  tools  are  very 
useful  in  application  fine-tuning,  but  usually  do  not  provide  completeness  (i.e.,  they  don’t 
measure  all  sources  of  overhead  in  the  execution).  As  a  result,  they  are  limited  to  cases  in 
which  the  prindpal  overheads  are  known  in  advance. 

A  number  of  researchers  in  the  paralld  performance  evaluation  and  tuning  community 
have  focused  on  measurement  of  multiple  parallel  overheads.  In  particular  the  PEM  system 
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has  developed  a  taxonomy  of  parallel  overheads  similar  to  ours  [Burkhart  and  Millen,  1989], 
and  Quartz  and  MemSpy  together  can  measure  the  overhead  categories  we  use  [Anderson 
and  Lazowska,  1990;  Martonosi  et  al.,  1992].  However,  since  these  (and  similar  tool  sets) 
are  not  oriented  toward  performance  prediction  of  alternative  implementations,  the  specific 
overhead  cat^ories  they  use  are  not  always  amenable  to  easy  analyris.  In  addition,  the 
completeness  criterion  has  not  been  emphasized  in  most  previous  overhead  measurement 
work  [Burkhart  and  Millen,  1989;  Meller-Nielsen  and  Staunstrup,  1987;  Tsuei  and  Vernon, 
1990;  Vrsalovic  et  ai.,  1988]. 

A  common  method  of  modeling  of  overheads  in  parallel  programs  is  scalability  analysis 
[Kumar  and  Gupta,  1991].  Scalability  analyris  develops  analytic,  asymptotic  models  of 
computation  and  selected  overhead  categories  as  a  function  of  the  rize  of  the  problem  n 
and  the  number  of  processors  p.  These  analyses  provide  insight  into  the  inherent  scalability 
of  a  particular  application  and  machine  combination.  Such  analyses  are  am  important  part 
of  our  approach,  but  they  don’t  provide  enough  mechanism  on  their  own  to  solve  the  best 
implementation  problem.  Most  importantly,  they  are  subject  to  the  problem  of  deciding 
beforehand  which  overheads  will  dominate,  which  can  be  error-prone.  In  addition,  they 
cannot  be  used  directly  for  performance  prediction  or  for  a  comparison  of  alternative  im¬ 
plementations  in  general  because  of  their  reliance  on  asymptotic  analysis,  and  on  constants 
which  must  be  determined  experimentally. 

In  addition  to  scalability  analyris,  many  other  anal3rtic  frameworks  are  based  on  the 
notion  of  parallel  overheads  [Carmona  and  Rice,  1991;  Eager  et  al.,  1989;  Flatt  and  Kennedy, 
1989;  Nicol  and  Willard,  1988].  The  analytic  portion  of  our  work  is  compatible  with  these 
other  frameworks  as  well  as  with  scalability  analyus,  so  we  will  use  that  term  to  include  all 
analytic  frameworks  that  deal  expressly  with  the  modeling  of  parallel  overheads. 

Other  methods  model  overheads  in  highly  empirical  ways  that  are  difficult  to  generalize 
[Abrams  et  al.,  1992;  Dimpsey  and  Iyer,  1991].  These  methods  develop  categories  of  exe¬ 
cution  state  based  solely  on  program  events;  as  a  result  it  is  difficult  to  predict  how  time 
spent  in  these  categories  would  diange  ifrith  changes  in  the  environment. 

Finally,  the  notion  of  selecting  alternative  implementations  based  on  the  environment  is 
present  in  the  ISSOS  system  [Schwan  et  al,,  1988].  The  emphasis  in  that  work  is  on  selecting 
alternative  implementations  dynamically  based  on  ongoing  system  monitoring.  As  a  result 
the  dynamic  adaptations  do  not  include  drastic  restructuring  of  the  application,  and  no 
guidelines  for  how  the  programmer  should  select  among  alternatives  were  developed. 

3  Identifying  and  Measuring  Lost  Cycles 

The  lost  cycles  approach  consists  of  the  following  steps: 

1.  programs  under  study  are  instrumented  transparently; 

2.  at  run  time,  the  programs  are  measured  to  determine  lost  cycles,  and  those  measure¬ 
ments  are  partitioned  into  a  small  number  of  categories; 

3.  lost  cycles  data  is  accumulated  for  a  range  of  environment  variations  (varying  the 
number  of  processors,  size  of  input,  structure  of  input,  or  machine  characteristics); 
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4.  data  for  each  measurement  category  is  fitted  to  an  appropriate  analytic  model,  a  task 
made  mmple  by  the  specific  categories  used;  and 

5.  the  resulting  category-spedfic  modeb  are  aggr^ated  into  an  overall  model  of  how  the 
program’s  performance  varies  depending  on  the  environment. 

Lost  cycles  bridge  the  gap  between  scalability  analysis  and  performance  evaluation  by 
combining  the  best  of  both  approaches.  By  basing  analysis  on  empirically  measured  over¬ 
heads,  lost  cycles  provides  "hard  numbers’*  for  use  in  scalability  models.  In  addition,  be¬ 
cause  the  overhead  categories  chosen  have  reasonably  well-understood  properties,  choosing 
the  proper  model  to  fit  the  data  is  fairly  strmghtforward.  On  the  other  hand,  by  adding 
an  analytic  bams  to  performance  evaluation,  we  extend  performance  evaluation  to  allow  its 
application  to  cross-execution  problems  such  as  the  best  implementation  problem. 


S.l  Categories  of  Lost  Cycles 


To  predict  performance,  we  must  be  able  to  predict  two  quantities:  total  lost  cycles  To, 
and  pure  computation  Tc,  as  functions  of  all  environment  variables.  Given  To  and  To, 
we  can  predict  running  time  as  Tp  =  (To  +  Te)/p.  As  long  as  the  parallel  algorithm  is  a 
parallelization  of  the  best  serial  algorithm,  pure  computation  is  equal  to  serial  running  time. 
In  that  case,  we  can  calculate  efficiency  and  speedup  as 


l+ft 

To 


S  =  p- 

We  measure  To  by  breaking  it  down  into  additive  categories.  Once  we  have  an  accurate 
measurement  for  To  we  can  obtain  T^:  Tc  =  pTp  —  To.  The  category  set  we  use  in  this  paper 
is: 


Load  Imbalance:  processor  cycles  spent  idling,  while  unfinished  parallel  work  exists. 

Insufficient  Parallelism:  processor  cycles  spent  idling,  while  no  unfinished  parallel  work 
ejdsts. 

Synchronization  Loss:  processor  cycles  spent  acquiring  a  lock,  or  wmting  in  a  barrier. 

Communication  Loss:  processor  cycles  spent  waiting  while  data  moves  through  the  sys¬ 
tem. 


Resource  Contention:  processor  cycles  spent  waiting  for  access  to  a  shared  hardware 
resource. 


This  category  set  has  proven  to  satisfy  the  three  criteria  (completeness,  orthogonality, 
meaning)  for  the  applications  we  have  studied.  Of  course,  it  will  need  to  be  expanded  to 
handle  a  wider  range  of  overheads  as  it  is  used  in  more  varied  situations.  In  particular, 
it  does  not  currently  distinguish  between  synchronization  types,  measure  contention  for 
sirftware  resources,  or  measure  operating  system  and  runtime  library  effects.  Each  of  these 
extensions  appears  to  be  strmghtforward  within  the  existing  framework  however. 
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8.2  A  Tool  For  Analyzing  Lost  Cycles 


Although  we  performed  model  selection  by  hand  for  the  examples  in  this  paper,  we  antici¬ 
pate  the  need  for  a  tool  that  manages  performance  data  and  focuses  the  user  on  a  selection 
of  appropriate  models  for  each  category  of  lost  cycles.  Such  a  tool  assists  the  user  in  two 
ways: 

1.  The  analysis  tool  stores  and  presents  lost  cycles  data,  and  provides  the  ability  to 
quickly  explore  different  models  for  each  variable. 

2.  The  analysis  tool  guides  the  user’s  selection  of  models  for  each  category  and  environ¬ 
ment  variable,  using  preferences  based  on  our  experience  in  modeling  the  categories 
of  lost  cycles.  For  example,  when  varying  data  set  size,  the  model  for  communication 
loss  defaults  to  a  linear  function  of  data  size.  Likewise,  when  varying  the  number  of 
processors,  the  model  for  insufficient  parallelism  loss  defaults  to  a  linear  function  of 
number  of  processors. 

Analyzing  lost  cycles  is  done  independently  for  each  environment  variable.  For  each 
variable,  a  small  number  of  data  points  are  taken.  GeneraUy,  the  programmer  should  select 
a  number  of  data  points  that  will  provide  enough  data  to  get  accurate  models  of  all  of 
the  overheads.  The  dmple  models  we  use  rarely  require  more  than  two  data  points  to 
parameterize,  but  additional  data  points  may  be  taken  for  two  reasons:  to  give  insight  into 
the  type  of  model  to  use,  and  to  eliminate  noise  in  the  data  and  get  a  more  accurate  model 
fit  (e.g.,  udng  a  least  squares  fit). 

An  example  we  will  present  in  Section  4  is  somewhat  extreme  because  it  only  uses  two 
data  points  for  the  complete  analysis.  However,  we  don’t  expect  that  a  large  number  of 
data  points  are  necessary  in  any  case,  and  the  number  needed  doesn’t  grow  faster  than  the 
number  of  environment  variables  being  considered. 

In  majiy  cases  it  will  be  necessary  to  perform  simple  analysis  on  the  prograun  to  ascertain 
the  proper  models.  The  examples  in  Section  4  exhibit  several  cases  where  this  occurs.  We 
expect  that  in  most  cases  this  analysis  be  straightforward. 

3.3  A  Tool  For  Measuring  Lost  Cycles 

In  previous  work  [Crovella  and  LeBlanc,  1993]  we  showed  that  basing  measurement  on 
logical  expressions  that  recognize  lost  cycles  is  a  particularly  useful  approach.  We  call  these 
expressions  performance  predicates.  The  use  of  performance  predicates  to  specify  categories 
of  lost  cycles  makes  program  instrumentation  straightforward,  and  allows  predicate  profiles 
to  be  constructed  based  on  user  demands.  For  example,  using  predicate  profiling,  the  user 
can  ask  for  a  breakdown  of  lost  cycles  by  processor  number,  task,  or  procedure.  The  current 
implementation  uses  predicate  profiling  of  event  logs,  rather  than  the  runtime  profiling  used 
in  our  earlier  work. 

Our  current  implementation  of  the  lost  cycles  measurement  tool  measures  Fortrcin  pro¬ 
grams  running  on  the  Kendall  Square  KSRl,  and  takes  the  form  of  a  library  linked  into  the 
executable  code.  The  KSR  Fortran  runtime  system  can  log  events  such  as  the  start  and  end 
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of  individual  loop  iterations,  which  we  use  for  calculating  load  imbalance.  Additional  calls 
to  our  library  routines  are  inserted  at  the  start  and  end  of  paxallel  loops,  parallel  tasks,  and 
^nchronization  operations.  The  inserted  library  calls  are  quite  simple  and  could  easily  be 
added  by  a  source-to-source  preprocessor. 

The  KSRl  [Research,  1991]  is  a  two-level  ring  architecture  in  which  all  memory  is 
managed  as  a  cache,  which  is  organized  in  two  levels  on  each  node.  Thus,  inter-node  com¬ 
munication  occurs  only  as  the  result  of  misses  in  the  secondary  cache.  Dedicated  hardware 
monitors  the  state  of  buses  between  the  processor  and  the  second-level  cache.  This  per¬ 
formance  monitor  counts  the  number  of  secondary  cache  misses,  the  time  taken  to  service 
secondary  cache  misses,  and  the  number  of  cache  lines  that  passed  through  the  higher-level 
ring  before  arrival.  Based  on  this  data,  we  can  calculate  the  amount  of  communication 
performed  in  an  execution  and  the  amount  of  resource  contention  that  occurred. 

Communication  loss  is  measured  as  a  simple  product  of  the  number  of  cache  misses  and 
the  ideal  time  to  perform  the  cache  line  transfers.  Resource  contention  (contention  for  the 
ring  interconnect  and  for  remote  memories)  is  measured  as  in  [Tsuei  and  Vernon,  1990]  — 
that  is,  the  ideal  time  to  perform  the  communication  operations  is  compared  to  the  actual 
elapsed  time.  Since  the  KSRl  hardware  monitors  both  the  number  of  cache  lines  transferred 
and  the  elapsed  rime  waiting  for  cache  lines,  this  calculation  is  straightforward.  Although 
the  performance  monitoring  hardware  on  the  KSR  is  rather  unique,  something  comparable 
may  be  required  for  other  cache-coherent  architectures.  On  simpler  architectures,  such 
as  a  message-passing  system,  the  performance  monitoring  capabilities  of  the  DEC  Alpha 
[Corporation,  1992]  should  be  sufficient  to  gather  the  same  information. 

4  Cross-Execution  Performance  Evaluation 

This  secrion  presents  three  examples  of  the  use  of  lost  cycles  analyris.  First  we  show  how 
to  construct  accurate  analytic  models  of  program  scalability  based  on  a  small  number  of 
measurements.  Then  we  present  two  examples  that  illustrate  how  lost  cycles  analysis  can 
hdp  solve  the  problem  of  selecting  the  best  parallelization  for  a  program. 

4.1  Modelling  the  Performance  of  2D  FFT 

The  ability  to  capture  the  expected  performance  of  a  program  based  on  a  small  number  of 
measurements  is  crirical  to  mana^ng  the  problem  of  understanding  and  selecting  among  dif¬ 
fering  implementations.  Measuring  and  debu^^g  program  performance  without  gathering 
large  amounts  at  data  is  an  important  capabili^  in  its  own  right,  and  is  the  subject  of  much 
current  effort  [Ball  and  Lams,  1992;  Hoffingsworth  and  Miller,  1993;  Miller  and  Choi,  1988; 
Netzer  and  hfiller,  1992].  The  results  in  this  section  show  that  in  addition  to  its  use  in 
saving  the  best  implementation  problem,  lost  cycles  modelling  is  a  convenient  way  of  cap¬ 
turing  large  amounts  of  performance  data,  requiring  minimal  measurement  effort  and  little 
storage. 

In  this  secrion  we  study  the  scalability  of  a  two-dimensional  discrete  Fourier  transform 
program  (2D  FFT).  This  program  is  umple  enough  to  demonstrate  our  method,  while  ex- 
Ubiring  a  wide  variety  of  performance  effects.  Our  implementation  of  the  program  consists 
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Category 

Abbrev. 

Model 

Varying  n 

Varying  p 

Pure  Computation 

PC 

log(n) 

1 

Load  Imbalance 

LI 

nlog(n) 

pVp 

Insufiident  Parallelism 

IP 

1 

P 

Synchronization  Loss 

SL 

0 

0 

Communication  Loss 

CL 

P 

Resource  Contention 

RC 

n* 

P,P>0 

Tkble  2:  Models  of  Overhead,  2D  FFT,  as  a  Function  of  n  and  p 


of  a  number  of  iterations  in  which  all  processors  first  partidpate  in  ID  FFTs  on  columns  of 
the  input  matrix,  then  transpose  the  matrix  in  parallel  and  perform  ID  FFTs  on  the  rows 
of  the  matrix.  Elach  iteration  of  the  pit^am  consists  of  5  parallel  loops:  one  to  initialize 
the  matrix,  one  to  perform  the  column-wise  FFTs,  two  to  transpose  the  matrix  (using  an 
intermediate  matrix),  and  one  to  perform  the  row-wise  FFTs.  Our  study  will  result  in  the 
ability  to  quickly  characterize  the  program’s  execution  time  and  efiSciency  over  a  range  of 
dataset  dzes  (from  32  x  32  points  to  1024  x  1024  points)  and  numbers  of  processors  (2  to 
26),  while  requiring  that  we  measure  only  a  small  number  of  executions. 

In  this  example  we  will: 

1.  Measure  the  program’s  lost  cydes  for  two  cases:  the  highest-overhead  case,  and  the 
lowest-overhead  case. 

2.  Select  appropriate  simple  modeb  for  each  cat^ory  of  lost  cycles  and  for  pure  compu¬ 
tation,  as  separate  functions  of  var^ng  data  size  and  varying  number  of  processors. 

3.  Use  the  measurements  made  in  step  1  to  parameterize  the  models  selected  in  step  2, 
yielding  predictions  for  running  time  over  the  entire  range  of  data  sizes  and  numbers 
of  processors. 

We  selected  two  data  points  for  measurement  because  by  measuring  an  execution  with 
high  relative  overhead  we  can  get  an  accurate  estimate  of  true  overhead,  and  by  measuring 
an  execution  \rith  low  relative  overhead  we  can  get  an  accurate  estimate  of  pure  compu¬ 
tation.  Rules  of  scalability  analysis  guide  us  in  selecting  the  data  points  for  measurement: 
in  a  parallel  system,  overheads  tend  to  grow  with  increasing  processors  and  decrease  with 
increamng  data  size.  These  observations  suggest  that  we  should  capture  lost  cycles  for 
an  execution  with  maximum  processors  and  minimum  data  (highest  overhead)  and  for  an 
execution  with  minimum  processors  and  maximum  data  (lowest  overhead). 

The  dmpte  models  we  chose  to  describe  each  overhe^  are  listed  in  Table  2,  as  separate 
functions  of  n  (the  length  of  a  side  of  the  input  matrix)  and  p  (number  of  processors).  Each 
model  has  an  implidtly  associated  constant;  the  purpose  of  our  lost  cycles  measurements  in 
step  1  is  to  discover  the  constants.  Each  of  these  models  is  a  simple,  initial  approdmation 
to  reality.  Better  models  for  each  are  possible,  but  not  necessary  in  this  context  unce 
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Figure  2:  Predicted  and  Actual  Performance  of  2D  FFT 

they  trade  increasing  accuracy  for  increasing  measurement  cost,  amd  decreasing  analytic 
tractability. 

Considering  first  the  models  for  varying  n,  the  model  for  pure  computation  is  based  on 
ample  algorithmic  analysis  of  2D  FFT.  The  model  for  load  imbalance  is  based  on  the  length 
of  an  iteration  of  the  program’s  parallel  loops.  There  are  no  synchronization  operations  in 
the  program,  so  we  expect  no  synchronization  loss.  The  model  for  insufficient  parallelism  is 
based  on  the  the  portion  of  the  code  that  runs  serially,  which  has  no  data  size  dependencies. 
The  model  for  communication  loss  is  based  on  the  total  amount  of  data  used.  Finally,  the 
model  for  resource  contention  is  based  on  the  expectation  that  resource  contention  will  be 
proportional  to  data  oze. 

In  choodng  models  of  overhead  as  we  vary  the  number  of  processors,  we  can  rely  on 
the  large  body  of  work  in  the  literature  to  provide  likely  candidate  models.  Most  of  the 
models  we  use  are  straightforward:  pure  computation  does  not  vary  as  we  vary  processors, 
insufficient  parallelism  obeys  Amdadtl’s  Law  [Amdahl,  1967],  and  synchronization  loss  is 
zero. 

Load  imbalance  can  arise  in  two  ways:  variation  in  the  running  time  of  each  loop  itera^ 
tion,  and  unequal  numbers  of  loop  iterations  handled  by  different  processors.  If  variation  in 
running  time  of  iterations  is  random,  the  time  taken  by  the  longest  iteration  can  be  modeled 
uring  order  statistics  (e.g.,  [Hummel  et  al.,  1992])  and  predicted  to  grow  proportionally  to 
Communication  loss  can  be  difficult  to  modri,  but  for  this  ample  application  is  propor¬ 
tional  to  p.  Finally,  resource  contention  can  be  expected  to  grow  linearly  once  the  number 
of  processors  passes  a  threshold  value. 

Uring  the  lost  cycles  measurements  from  the  two  executions  we  then  parameterize  the 
rix  modeb  (PC,  LI,  IP,  SL,  CL,  and  RC).  Udng  the  basic  identity  Tp  =  (Te  +  To)/p,  we 


construct  the  performance  model  for  the  implementation  as: 

PC{n,p)  +  LJ{n,p)  +  JP(n,p)  +  SL{n,p)  +  CL(n,p)  +  RC{n,p) 

Tpy^tP)  —  - 

The  results  for  the  2D  FFT  program  are  shown  in  Figure  2.  These  plots  show  the  perfor¬ 
mance  of  the  application  measured  in  Mflops,  as  a  function  of  both  number  of  processors 
and  of  dataset  uze.  The  left  hand  plot  shows  the  predictions  of  our  model  for  78  data 
points,  that  is,  all  points  within  the  range  of  processors  and  data  set  sizes  we  set  out  to 
model.  The  right  hand  plot  shows  the  actual  measured  performance  of  the  application  on 
the  KSRl  for  those  same  78  data  points. 

As  can  be  seen,  the  model  is  an  idealized  but  reasonably  accurate  approximation  to 
actual  performance.  In  fact,  the  average  relative  error  of  the  model  with  respect  to  the 
actual  performance,  over  all  78  points,  is  only  12.5%.  Fbr  comparison,  the  average  relative 
error  of  a  mmple  Unear  interpolation  based  on  a  least  squares  fit  of  the  four  “corner”  points 
is  over  750%.  Thus  both  the  overall  shape  of  the  predicted  performance  curve  and  its 
actual  values  are  sulBSciently  accurate  to  allow  it  to  be  used  in  studying  tradeoffs  agmnst  an 
alternative  implementation,  which  we  wiU  do  in  the  next  section. 

4.2  'Kusk  Parallel  vs.  Data  Parallel  2D  FFT 

The  2D  FFT  program  we  used  in  the  previous  section  has  an  alternative  implementation 
that  uses  task  parallelism  as  weU  as  data  paralleUsm.  In  this  implementation,  processors 
are  segregated  into  two  groups  uang  tasking  directives.  One  group  initializes  the  matrix 
and  performs  datarparallel  row-wise  ID  FFTs,  while  the  other  group  transposes  the  matrix 
and  performs  data-paraUel  column-wise  ID  FFTs.  The  two  tasks  are  pipeUned  so  that  each 
one  is  kept  busy  working  on  separate  matrices. 

The  performance  of  these  two  implementations  on  the  iWarp  was  studied  in  [Subhlok 
et  o/.,  1993].  On  that  machine,  the  authors  discovered  that  as  data  set  sizes  are  varied  past 
a  certain  threshold,  the  choice  of  which  implementation  is  best  changes.  For  small  data 
sets  (n  <  128)  the  parallel  tasking  implementation  outperformed  the  pure  data  parallel 
implementation.  For  large  data  set  mzes  (n  >  256),  the  purely  data  parallel  implementation 
outperformed  the  parallel  tasking  implementation.  The  principal  reason  for  this  effect  is 
that  in  the  parallel  task  version,  communication  between  tasks  must  pass  through  a  single 
channel  of  the  iWarp  network,  whUe  purely  data  parallel  communication  can  take  place 
along  multiple  channels.  For  small  data  dzes,  the  larger  problem  granularity  of  parallel 
tasking  leads  to  better  performance,  but  as  problem  sizes  increase,  intertask  communication 
becomes  a  bottleneck. 

It  is  interesting  to  ask  whether  a  dmilar  effect  would  be  observed  when  this  application 
is  run  on  the  KSRl,  a  maclune  with  a  dgniffcantiy  different  architecture.  Unfortunately, 
the  data  from  the  iWarp  cannot  help  us  dedde  which  executions  to  measure,  since  the 
machines  are  so  different.  Thus  we  immediatdy  run  into  the  best  implementation  problem: 
perhaps  there  is  a  crossover  between  implementations  in  some  section  of  the  environment 
space  (here,  n  and  p),  but  finding  the  crossover  would  require  measurements  over  the  entire 
space. 
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Gat^ory 

Data  Parallel 

Task  Parallel 

Pure  Computation 

n*  lofcln) 
3550 

nMog(ni 

3350 

Load  Imbalance 

nlog(n) 

63.0 

81.9 

Insufficient  Parallelism 

3.36 

fi* 

992 

Synchronization  Loss 

0 

Communication  Loss 

Resource  Contention 

n* 

20100 

35600 

T^ble  3:  Performance  Models  for  Data  Parallel  and  Tksk  Parallel  2D  FFT 


lb  answer  this  question  udng  the  loet  cycles  approach,  we  only  need  to  construct  a 
model  for  the  application.  The  simplest  approach  in  this  case  is  to  1)  dedde  whether  the 
categcHy  models  used  for  the  pure  data  paralld  implementation  follow  the  same  functions; 
and  2)  determine  new  constants  for  the  overhead  functions.  To  do  this  we  measured  6  points 
var^g  the  data  set  use,  (to  explore  the  funcUons  of  n)  and  6  points  varying  the  number 
ct  processors  (to  explore  the  functions  of  p). 

The  results  are  shown  In  Table  3.  The  taUe  shows  the  functions  and  the  associated 
constants  for  the  n  variable,  since  the  models  did  not  differ  rignihcantly  in  the  p  dimension.^ 
These  functions  immediately  answer  our  questions  about  these  two  implementations.  First 
of  all,  resource  contention  in  the  task  parallel  implementation  is  rigniiicantly  less  than  in 
the  data  parallel  implementation,  in<Ucating  that  the  channel  bottleneck  effects  observed  on 
the  iWarp  will  not  be  present  on  the  KSRl.  This  conclurion  is  reasonable,  since  intra-ring 
communication  costs  are  insenative  to  source  and  destination  on  the  KSRl.  In  fact,  we  see 
that  resource  contention  is  only  about  half  as  great  in  the  task  parallel  version,  since  in  the 
pure  data  paraUel  version,  all  processors  are  amultaneously  requesting  and  providing  data 
during  the  matrix  transpose,  while  in  the  parallel  task  version,  half  the  processors  request 
data  and  the  other  half  proWde  it. 

The  second  observation  is  that  on  this  machine,  the  task  parallel  implementation  will 
always  perform  more  poorly  than  the  pure  data  parallel  implementation.  Synchronization 
toss  and  insuffident  parallelism  ate  functions  (ff  in  the  task  parallel  implementation.  The 
reason  for  this  change  from  constant  values  to  functions  of  when  the  implementation 
is  changed  can  be  seen  in  observing  that  synchronization  loss  is  now  equal  to  about  a 
third  (ff  the  communication  loss.  In  fact,  in  this  implementation,  the  two  tasks  do  not 
incur  equal  overhead.  The  task  that  transposes  the  matrix  incurs  more  overhead  because 
it  must  traverse  the  source  matrix  across  cache  lines,  destroying  locality.  Thus  each  loop 
iteration  for  the  transpodng  task  takes  riightly  longer  than  an  iteration  of  the  initializing 

’We  hold  p  conotant  at  its  nuudmnm  value  (26)  in  theae  formulae. 
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Cat^ory 

Processors 

mm 

2 

3 

4 

5 

6 

7 

Wasted  Speculation 

0.00 

50.1 

103 

10.7 

14.2 

1.25 

1.44 

Pure  Computation 

51.4 

51.7 

3.56 

3.52 

0.227 

0.214 

l^le  4:  Seconds  of  Pure  Computation  and  Wasted  Speculation  in  Subgraph  Isomorphism 


task;  a  pair  of  spinlocks  prevents  mther  task  from  overtaking  the  other.  As  a  result  of  this 
synchronization  loss  in  the  main  thread  of  the  initializing  task,  the  other  threads  in  its 
group  must  wait  without  work,  incurring  lost  cycles  doe  to  insufficient  parallelism.  This 
insuffident  paralleUsm  has  a  particularly  small  constant  in  the  denominator  and  hence 
dominates  the  small  improvements  in  resource  contention  and  load  imbalance  generated  by 
task  paralldism.^ 

Thus  we  have  quickly  answered  the  question  of  whether  two  implementations,  known 
to  have  a  performance  tradeoff  on  at  least  one  architecture,  have  a  similar  performance 
tradeoff  on  the  architecture  of  interest  to  us.  To  do  this,  we  only  needed  to  measure  a 
small  number  of  data  points  in  each  of  the  2  environmental  dimensions,  and  compare  the 
resulting  lost  cydes  models. 

4.3  Subgraph  Isomorphism 

We  now  return  to  the  example  of  subgraph  isomorphism.  Referring  to  Figure  1,  we  would 
like  to  understand  the  differences  among  the  four  parallelizations  of  subgraph  isomorphism 
(tree,  tree  plus  instruction,  loop,  and  loop  plus  instruction)  as  a  function  of  p.  Since 
we  ate  only  stud^g  one  dimension  of  the  environment  in  this  case,  we  can  measure  the 
implementations  over  the  entire  range  and  use  the  resulting  modeb  to  explain  their  relative 
performance. 

In  order  to  achieve  completeness  for  thb  application,  we  need  to  indude  measurement 
of  cycles  lost  in  wasted  computation  (due  to  fruitless  speculation).  We  will  define  wasted 
computation  in  tins  case  as  the  processor  cydes  spent  searching  a  subtree  under  the  root 
in  which  no  solutions  are  found.  Modeling  thb  cat^ory  b  difficult,  but  since  we  are  only 
searching  a  single  dimension,  we  can  simply  measure  lost  cydes  due  to  wasted  computation 
for  the  pdnts  of  interest. 

First  of  all,  we  show  the  wasted  speculation  and  pure  computation  data  for  the  tree 
parallel  case  in  Table  4.  Thb  table  immediately  explains  why  the  tree  parallel  versions 
outperform  the  loop  paralld  verdons  when  p  >  3,  yet  do  worse  than  loop  parallel  versions 
when  1  <  p  <  4.  The  pure  computation  required  to  solve  the  problem  changes  drastically 
udth  increasing  p  because  these  executions  of  the  program  are  searching  for  only  one  solution 
—  the  first  processor  to  find  a  solution  ends  the  computation.  Clearly,  the  subtrees  searched 
by  processors  2  and  3  do  not  yield  the  sdution,  nnce  wasted  computation  increases  in  steps 

50  seconds  (the  time  spent  by  processor  1  in  finding  a  solution).  Processor  4  finds  a 

’PremmsUy  theae  effecta  were  not  preecnt  go  the  iWatp  becauae  of  meaaage  paaamg  cptimizatioos. 
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Category 

Implementation 

Loop 

Loop  +  Instr. 

Pure  Computation 

54.7 

55.9 

Load  Imbalance 

PJP 

.750 

vJv 

1.46 

Insuffident  Parallelism 

-E_ 

.496 

je_ 

2.08 

Synchronization  Loss 

0 

0 

Communication  Loss 

lio 

JL- 

.233 

'nible  5:  Lost  Cycles  Models  for  Two  Implementations  of  Subgraph  Isomorphism 


scdotion  in  its  subtree  much  sooner  than  the  others  however,  and  the  effect  is  repeated 
again  by  processor  6.  The  data  in  this  table  shows  that  even  when  overhead  categories  are 
<fffficalt  to  model,  the  taw  lost  cycles  data  can  be  informative  in  ways  that  are  difficult  for 
tods  that  do  not  provide  completeness. 

Next,  we  oondder  why  loop  and  instruction  parallelism  together  outperform  loop  paral- 
Idkm  alone.  Since  there  is  no  speculative  computation  in  the  loop  parallel  executions,  this 
is  done  most  eaaly  by  contidering  the  lost  cycles  models,  which  are  shown  in  Thble  5.^ 

We  ndght  expect  that  the  benefit  of  adding  instruction  parallelism  to  the  implemen¬ 
tation  would  be  in  decreased  pure  computation  (ance  instruction  paralleUsm  shows  up  as 
decreased  pure  computation).  However,  the  lost  cycles  models  show  that  in  fact,  pure  com¬ 
putation  is  relatively  unchanged  between  the  two  implementations.  The  model  shows  that 
the  additional  cost  of  packing  and  unpacking  data  counterbalances  the  pure  computational 
decrease  gained  in  parallel  set  operations,  actual  increasing  pure  computation  slightly. 

In  fact,  the  guns  from  adding  instruction  parallelism  come  from  a  less  expected  direction. 
First  all,  by  packing  datasets,  the  overall  data  being  transferred  decreases,  decreaung 
communication  loss  by  a  factor  of  2.  Secondly,  load  imbalance  within  loops  and  insuffident 
panlldism  are  decreased  dnce  these  overheads  both  tend  to  increase  as  communication 
increases. 

The  loet  cycles  models  in  Table  5  also  explain  why  the  loop  parallel  implementation 
benefits  from  the  addition  of  instruction  parallelism,  while  the  tree  parallel  implementation 
actually  suffists  slightly  from  the  same  addition.  Since  the  tree  parallel  version  searches 
separate  subtrees  in  parallel,  it  has  essentially  no  communication  loss  (this  effect  can  be 
observed  from  the  kwt  cycles  data  as  well).  The  abeence  of  communication  loss  in  the  tree 
paralld  implementation  means  that  it  cannot  reap  the  benefits  of  instruction  parallelism, 
whether  directly  in  decreased  communication,  or  indirectly  through  decreased  load  imbal¬ 
ance  and  insuffident  parallelism.  Instead  the  tree  parallel  vetdon  only  pays  the  (small) 
price  of  instruction  paralldism,  a  fact  reflected  in  the  performance  data  shown  in  Figure  1. 


^TbsK  data  ve  dnmd  from  animplgincpUtioo  of  the  LC  tool  on  the  Silioon  Gnpfaki  liia  that  did  not 
allow  the  BaoMmiMnt  of  renorce  oontentiaa. 
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In  this  example  we  have  gained  a  number  of  insights  into  the  relationship  between 
the  particular  implementation  of  the  subgraph  isomorphism  program  and  its  corresponding 
performance.  These  intights  were  gained  from  measurements  of  only  12  data  points,  showing 
the  power  of  lost  cycles  analytis  to  provide  tuning  guidance  —  especially  since  these  insights 
were  not  evident  from  our  ori^al  collection  of  over  37,000  data  points. 

5  Conclusion 

We’ve  argued  that  an  important  land  of  performance  tuning  consists  of  selecting  among 
alternative  imi^ementations  of  a  parallel  program,  and  that  this  problem  suffers  from  com¬ 
binatorial  explotion  as  one  increases  the  number  of  environment  variables  considered  during 
tuning.  To  address  this  problem,  we  have  presented  a  technique  that  combines  performance 
measurement  with  analytic  modeling.  Our  technique  is  based  on  measuring  overhead  cat¬ 
egories  that  meet  the  three  criteria  of  completeness,  orthogonality,  and  meaning;  we  have 
shown  how  each  of  these  criteria  is  necessary  in  order  to  reliably  solve  the  best  implemen¬ 
tation  problem. 

Our  examples  show  that  modeling  lost  cycles  has  significant  flexibility.  Our  first  examine 
showed  how  modeling  lost  cycles  can  capture  a  great  deal  of  performance  data  using  only 
a  small  number  of  measurements.  Our  second  example  studied  the  best  implementation 
problem  for  two  implementations  and  used  lost  cycles  modeling  to  expose  differences  in 
the  asymptotic  behavior  of  the  two  applications.  Finally  our  last  example  showed  that 
the  raw  data  output  from  lost  cycles  measurements  can  be  informative  in  its  own  right, 
and  that  comparison  of  lost  cycles  models  can  expose  subtle  interactions  among  program 
components. 

We  are  continuing  to  develop  the  lost  cycles  approach  in  two  directions.  We  intend  to 
extend  the  set  of  categories  measurable  so  that  lost  cycles  analysis  will  be  complete  for  a 
wider  range  of  applications.  We  also  expect  to  develop  our  data  analysis  capabilities  to 
guide  the  user  to  the  best  choice  of  model  for  each  category  of  lost  cycles. 
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